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Abstract — Which communication rates can be attained over 
a channel whose output is an unknown (possibly stochastic) 
function of the input that may vary arbitrarily in time with no 
a-priori model? Following the spirit of the finite-state compress- 
ibility of a sequence defined by Lempel and Ziv, a "capacity" 
is defined for such a channel as the highest rate achievable by 
a designer knowing the particular relation that indeed exists 
between the input and output for all times, yet is constrained 
to use a fixed finite-length block communication scheme (i.e., 
use the same scheme over each block). In the case of the 
binary modulo additive channel, where the output sequence is 
obtained by modulo addition of an unknown individual sequence 
to the input sequence, this capacity is upper bounded by 1 p 
where p is the finite state compressibility of the noise sequence. 
A communication scheme with feedback that attains this rate 
universally without prior knowledge of the noise sequence is 
presented. 

I. Introduction 

Consider the problem of communicating over a channel, 
where the (possibly stochastic) relation between the input and 
output is unknown to the transmitter and the receiver and may 
be, in general, non stationary. In particular, no assumption 
is made that the channel behavior up to a certain point in 
time indicates anything about its expected behavior from this 
time on. The key characteristic of such a channel is that the 
channel law cannot be learned, i.e. it is impossible, using 
an asymptotically short measurement period, to obtain the 
channel probability law and use it during the rest of the 
transmission. 

Clearly, communication over such an arbitrary channel is 
challenging. Furthermore, even the question what the limits of 
such communication are, is not well posed. To emphasize the 
fact that the relation between input and output is a function of 
the entire sequences (or vectors) this channel shall be termed 
a vector channel. A simple example of such a channel, which 
was discussed by Shayevitz and Feder f]} is the modulo- 
additive channel with an individual noise sequence, defined 
by the relation y = x + z where x, y, z G are n-length 
vectors, denoting the input, output and the noise sequence, 
A" is a finite alphabet, the '+' denotes modulo addition over 
X, and the sequence z is arbitrary and unknown. The main 
focus in the current paper is on this channel model. When the 
alphabet is A" = {0, 1} this channel is referred to as the binary 
additive channel. 

In the general vector channel, when the conditional proba- 
bility of the output vector given the input vector is known, the 
classical Shannon capacity, i.e. the maximum communication 



rate achievable with an arbitrarily small error probability, is 
well defined. The Shannon capacity of the general causal 
vector channel was given by Han and Verdu |2|. When the 
channel is unknown, the Shannon capacity of the known 
channel is in many cases not attainable universally. In this 
case, the compound channel or arbitrarily varying channel 
(AVC) frameworks |[3l may be used. In these frameworks, the 
capacity is defined as the maximum rate of transmission which 
guarantees robust communication over all possible channels. 
However these frameworks do not consider the ability to use 
feedback to adjust the communication parameters, and are 
therefore worst-case in nature. On the other hand, Shayevitz 
and Feder 1 1 ] have shown that for the modulo-additive channel 
with an individual noise sequence, by using feedback to adapt 
the transmission rate to the actual channel occurrence, these 
worst case assumptions may be alleviated. These results were 
extended by us and other authors EJ, Q- 

Since the channel is unknown, the target is to find a univer- 
sal communication system that operates without knowing the 
channel. While there are known universal source encoders ||6l 
and universal predictors |7|, in the communication problem, 
the term "universality" had been used mainly with respect to 
decoders (competing against the maximum likelihood decoder 
in a compound channel |3|, L8J), and there is currently no 
notion of universality with respect to the complete commu- 
nication system. This is since in the traditional AVC model, 
feedback is not considered and therefore the encoder is as- 
sumed to be fixed. On the other hand, in existing works that 
consider adaptation of the communication rate using feedback 
yj, |4|, 1 5 1, the communication rates achieved do not have a 
strong justification. For example, these works define the rate 
using 0-order empirical distributions, and higher rates could be 
attained by considering empirical distributions with memory. 

Let us denote by Py|x ^ conditional distribution of the 
channel output given the input defining a vector channel, 
where 9 is an index belonging to a (possibly infinite) index set 
Q. Given a class of vector channels {PY|x}eee> the objective 
is to assign a rate Cg to each channel, such that on one hand 
Ce has an operational meaning, for example the maximum 
rate achievable in a certain situation, and on the other hand, 
it would be possible to construct a universal system using 
feedback, that without knowledge of 9, attains a rate of at 
least Ce for all 9. The difference from the AVC or compound 
channel models is that the communication rate depends on 
9. As shall be seen, the maximum rate achievable by block 
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encoders and decoders is a reasonable target, that at least for 
the class of modulo-additive channels is universally attainable. 
Note, however, that the system attaining this target rate is not 
in itself a block encoding system. 

This paper contains two main contributions. The first is a 
definition of a target rate Cg ~ Cifb(-Pyix) for ^ny vector 
channel, which is termed the "iterated finite block capacity", 
and highlight some of its important properties. The other 
contribution is specific to the modulo-additive channel with an 
individual noise sequence, for a universal system that attains 
this target rate without knowing the channel is presented. The 
paper is organized as follows: in Section|Il]the motivation for 
the definition of Cifb are explained. Section III is a high level 
overview of the results regarding the modulo-additive channel. 



and the main ideas behind the proofs. Section IV includes the 
detailed definitions with some discussion. Section |V] focuses 
on the modulo additive channel and includes the upper bound 
on Cifb and the universal system achieving it. The redundancy, 
i.e. the convergence rate, in achieving the IFB capacity is 
explored in Section [Vl] Section |VII| is devoted to discussion 
and comments and suggests some extensions and alternative 
definitions. 

II. Motivation 

Let us now discuss the motivations for the definitions of 
a target rate which is universally achievable. The inherent 
difficulty of defining the maximal communication rates over 
arbitrary vector channels can be appreciated by consider- 
ing even the simple example of a binary additive channel 
y = X z with an individual noise sequence z, where 
'®' denotes modulo-2 addition. For every specific individual 
noise sequence z, the capacity of this channel is 1 bit/use. 
On the other hand, if the noise sequence is arbitrary and 
unknown the AVC capacity O is zero. It would initially seem 
that nothing much can be done, when the noise sequence is 
unknown; however it was shown 1 1 1 that using feedback and 
common randomness, and by adapting the decoding rate, a 
communication rate of i? = 1—H{z) could be achieved, where 
H{-) denotes the empirical entropy of the noise sequence (the 
binary entropy of the empirical cross-over probability). The 
main idea is that if the empirical channel can be measured 
and the communication rate can be adapted, then rather than 
making a-priori pessimistic assumptions, one can opportunis- 
tically increase the rate when the noise sequence has a low 
empirical entropy. 

A disturbing fact is that some arbitrariness exists in deciding 
on the rates to achieve per each channel: in the binary additive 
channel, given a sequence s of choice, one could also design 
a system that achieves the rate 1 — H{z © s), by adding the 
sequence s to the channel output and then applying Shayevitz 
and Feder's scheme 11]. Doing so, a rate of 1 is obtained for 
the sequence z = s, where in the original system the rate 
was 1 — H{s), and a rate of 1 — H{s) for the noiseless case 
z = 0, so we may say the noise sequence s is "favored" over 
0. This demonstrates the arbitrariness in determining which 
communication rates are possible. To remove this arbitrariness, 
a reasonable criterion is sought, to decide which channels 
(noise sequences, in the example) to favor over others. 



This issue bears significant resemblance to issues tackled 
in universal source coding (compression) and in universal 
prediction. In universal compression, one would like to set a 
target for the compression rate of an individual sequence. As 
in our problem, someone who knows the sequence can design 
an encoder which compresses it to 1 bit, whereas assuming 
the sequence is completely unknown and without favoring 
any sequence over another, no compression can be achieved. 
There are many possible fixed to variable encoders which are 
uniquely decodable, and the decision between them may seem 
arbitrary. One solution proposed by Lempel and Ziv f6l was 
to set as a target the compression rates that are achievable by 
machines with limited capabilities, i.e. finite state machines 
(FSM). They defined the notion of finite state compressibility 
for an infinite sequence, as the best compression rate that 
can be achieved by any information lossless FSM operating 
over the (infinite) sequence, and had shown that the LZ78 
compression algorithm based on incremental parsing (defined 
there), achieves this compression rate universally for any 
sequence. This concept supplies a criterion to decide which 
sequences to favor over others, without assuming a probability 
law. A similar notion, i.e. that of comparing against the best 
machine out of a restricted class, is applied in universal 
prediction Q, 0- 

Following this lead the comparison class is chosen to be 
the set of fixed finite-length block encoders and decoders, 
which repeatedly perform the same encoding and decoding 
operations over blocks of any fixed length (Figure |2|i. This 
class is a relatively simple one, while still yielding a reasonable 
criterion to set the communication rate. The iterated finite 
block capacity of an infinite vector channel Cifb is defined 
as the supremum of all rates which are reliably achievable by 
encoders and decoders in the comparison class. This capacity 
value is smaller, in general, than the Shannon capacity of the 
vector channel. This definition has operational significance, 
since many practical communication systems use block en- 
coding, and therefore universally attaining the Cifb means that 
one can design a system which, without any prior knowledge 
of the channel, is essentially at least as good as any system 
using block coding of any finite length. The universal system 
itself does not belong to the comparison class - it does not 
operate in fixed blocks, it modifies its behavior based on 
the past, and it uses feedback. There are various ways to 
extend the comparison class, which are briefly discussed in 
Section |VII| however, here the focus on this basic model. 
Although achieving Cfb universally is possible for classes of 
vector channels wider than the modulo-additive channel |10|, 
it is not possible to attain this rate for general unknown vector 
channels. 

III. Overview of the main results 

This section provides an informal review of the main results 
and rough proof outlines. The purpose is to provide an under- 
standing of the results without diving into mathematical detail. 
The main results of this paper pertain to the modulo-additive 
channel with individual noise sequence. For this channel, it is 
shown in Section [v] that Cifb < (1 ^ p{'^)) ' logl-^l; where 
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Fig. 1. Illustration of the decoding rule of the rate adaptive system. L(z') 
is the compression length. Decoding thresholds with respect to A^i = j ■ 
log \X\ — L{z^) are depicted by horizontal lines. 



p(z) is the finite state compressibiUty of z (as defined by 
Lempel and Ziv f6|). Assuming that common randomness 
exists and that there is a feedback Hnk, a universal system 
employing feedback exists, which asymptotically attains this 
rate universally without prior knowledge of the noise sequence. 
In Section |VI] upper and lower bounds on the convergence rate 
are derived. Below the main ideas in the proofs are described. 

Let us begin with the upper bound on Cifb- Suppose a 
given encoder and decoder (the reference system) achieves 
rate R over b blocks of size k (Figure |2|. During these b 
blocks, the reference system "sees" b different noise vectors 



of length fc, namely z 



ik 

(i-l)fc+l' 



1, . . . ,b. Since the system 



is fixed during these b blocks, this is equivalent to operating 
over a stochastic channel, where the noise vector Z is chosen 
uniformly from the set of these vectors, with probability i 
for each. This random vector is termed the "collapsed" noise 
sequence, and the channel generated from it the "collapsed" 
channel. The standard converse of the channel capacity the- 
orem (without the assumption of a memoryless channel) can 
be applied to the collapsed channel (Figure |5]l, and yields an 
upper bound on Cjfb which is roughly logjA"! — j:H{Z). 
The entropy H{Z) is lower bounded using the finite state 
compressibility of the sequence, since a finite state machine 
may achieve a compression rate close to the entropy by 
standard block-to-variable coding, where the code lengths are 
tuned to the statistics of the collapsed noise vector. Combining 
these bounds yields the result Cifb < (1 — p{^)) ■ logjA"! 
(Theorem [TJ. 

Next, a communication scheme is demonstrated, that 
asymptotically attains the rate logjA"! — iL(z), where L{z) 
is the compression length of the sequence z by a given 
sequential source encoder, and n is the overall block length. 
The scheme is based on iterative application of rateless coding, 
sending K bits in each block. Each codeword in the codebook 
of exp{K) words is chosen independently and distributed 
uniformly over Af". The transmitter sends symbols from the 
codeword matching the K transmitted bits, until a termination 



condition occurs on the receiver side. Then, the receiver 
indicates the end of the block through the feedback link 
and a new block begins. The termination condition is based 
on feeding into the source encoder the sequence of noise 
which is known in high probability from previous blocks 
that had been decoded (since both the channel input and the 
channel output are known), and then, for each of the exp{K) 
hypotheses regarding the codeword sent in the current block, 
continuing this sequence with the hypothetical noise sequence 
(formed by the known output and hypothetical input), to form 
an hypothesis for the noise sequence from the beginning of 
transmission to the end of the current block z\. For each 
hypothesis, the decoder counts the number of bits that reflect 
the compression of the noise sequence in the the current block, 
and terminates the block if for any codeword, this length is 
smaller than a threshold. 

The proof of this scheme's performance is roughly as 
follows. Due to the random coding, most of the hypothe- 
ses (except the true one) yield random noise sequences. 
These sequences are incompressible, and therefore the number 
of bits representing the last block would be approximately 
log I A" I times the number of symbols in the block. It can 
be shown that setting the threshold approximately K below 
this value, guarantees a small probability of exceeding the 
threshold for any of the exp{K) incorrect codewords, 
and therefore a small probability of error. It is convenient to 
define the "incompressibility" of the sequence up to time i 
as Ni — i ■ logjA"! — L{z\), representing the gap between 
the compressibility of the hypothetical noise sequence, and 
the compressibility of a random sequence. The approximate 
termination condition may be interpreted as decoding when the 
value of Ni increases by K from the start of the current block. 
Since when this occurs, the system starts a new block, there is 
a correspondence between the increase in Ni and the number 
of blocks and bits that are transmitted, i.e. the termination 
condition can be approximately interpreted as Ni > K{b+1) 
where b is the number of blocks so far. Therefore assuming by 
time n, B blocks were transmitted, the number of transmitted 
bits \s K ■ B K, Nn = n ■ log \ X\~ i(z"). Assuming no errors 
occurred z" = z", and dividing by n the desired result is 
obtained. This is depicted in Figure [T] where the horizontal 
axis is the time i. The solid line presents L{z\), and the 
dashed line iV^. The decoding thresholds Kb (b — 1,2,...) 
are depicted as horizontal lines, while the vertical lines depict 
the decoding times. It A decoding occurs whenever Ni crossed 
a threshold. A random hypothesized sequence in the current 
block implies that Ni does not increase on average. It can be 
seen that the number of bits that will be sent is approximately 
Nn- In the full proof, various overheads that were neglected 
above are accounted for. 

To obtain the universal system attaining Cifb (Theorem |2|l, 
the scheme above is applied with the encoding lengths L{z) 
determined by the LZ78 source encoder, whose compression 
ratios asymptotically approach the finite state compressibil- 
ity: asymptotically L{z) < yo(z) log | A"], therefore logjA"! — 
^L(z) > (1 — p{z)) log \X\ (where all inequalities are up to 
asymptotically vanishing factors). 



Section VI deals with the question of redundancy, or how 
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fast the system converges to the rate attained by the IFB system 
with a given block length k. Unfortunately, it is shown that n 
must grow at least as fast as \X\'', approximately. The upper 
bound on redundancy is obtained by using a similar universal 
system employing a slightly more refined design: instead of 
using Lempel-Ziv compression algorithm to generate decoding 
metrics, a universal probability assignment based on a mixture 
of Krichevsky-Trofimov distributions 1 1 1 1 is used. The lower 
bound is obtained by presenting a design of an IFB system 
together with a random channel (i.e. a distribution over noise 
sequences), such that the mutual information over the channel 
is smaller than the rate obtained by the IFB system. This is 
possible because the IFB system is designed together with 
the channel and can use the knowledge of the specific noise 
sequence. On the other hand, the rate obtained by any universal 
system with feedback is bounded by the mutual information, 
and this gap comprises the lower bound on redundancy. 
Although the upper bound and lower bound on redundancy 
differ, they agree in terms of the asymptotical growth rate of 
n as function of k. 

IV. Channel model and definitions 

In this section we begin the formal presentation of the 
results, by presenting the channel model and the definitions 
of the capacity Cjfbj and discussing their implications. 

A. Notation 

Vectors are denoted by boldface letters. Sub-vectors are de- 
fined by superscripts and subscripts: x* = [xj,Xj+i, . . . , Xi]. 
X* equals the empty string if i < j. The subscript is sometimes 
removed when it equals 1, i.e. x' ^ x^ 

For a vector or random variable X, X^*^' ^ x|^_J|^^^ 
denotes the i-th block of length k in the vector For brevity, 
vectors with similar ranges are sometimes joined together, 
for example, the notation (XY)^^ is used instead of Xj^Yj"'. 
Exponents and logs are base 2. Random variables are dis- 
tinguished from their sample values by capital letters. We 
use the following notation for empirical distributions: for a 
list or vector A — {xi,X2,X3, . . .), P{A ~ x) denotes the 
relative number of occurrences of x within A. For example, 
P{z = 1) — P{{zi)"^i = 1) denotes the normalized number 
of 'I'-s in z. 

B. Channel model 

Let X and y be infinite sequences denoting the input and 
the output respectively, where each letter is chosen from the 
alphabets X,y respectively, Xi £ X,yi S y. Throughout the 
current paper the input and output alphabets are assumed to 
be finite. A channel Py|x is defined through the probabilistic 
relations PY|x(y"|x°°) = Pr(Y" = y"iX°° = x°°) for 
n = 1,2, ...oo. A finite length output sequence is considered 
in order to make the probability well defined. Sometimes, this 
probability will be informally referred to as Pi{Yf°\X^), 
and should be understood as the sequence of the above 
distributions, or their limit for n — > oo. 



Definition 1. The channel defined by Pr{Y{^\X^) is termed 
causal if for all n: 

Pr(Y5^|X?°) =Pr(Y5^|X5'). (1) 

All the definitions below (including IFB capacity) pertain 
to causal channels. This characterization of a causal channel 
is similar to the definition used by Han and Verdu (|2l (and 
references therein). This definition is also limited in assuming 
the channel starts from a known state (at time 0). However this 
does not limit the current setting, because an arbitrary initial 
state can be modeled by considering the family of channels 
with all possible initial states. Note that non causality that con- 
sists of bounded negative delays can always be compensated 
by applying a delay to the output. 

C. IFB capacity 

The following definitions lead to the definition of IFB 
capacity . 

Definition 2 (Reference encoder and decoder). A finite length 
encoder E with block length k and a rate i? is a mapping E : 
{!,..., M} from a set of M > exp(A;i?) messages to a 

set of input sequences X'^. A respective finite length decoder 
_D is a mapping D : y'^ {1, . . . , AI} from the set of output 
sequences to the set of messages. 

Definition 3 (IFB error probability). The average error prob- 
ability in iterative mapping of the k length encoder E and 
decoder D to b blocks over the channel Py|x is defined as 
follows: b messages mi,..., nib are chosen as i.i.d. uni- 
formly distributed random variables m.i ^ U{1, . . . , M},i = 
1,...,6. The channel input is set to x['^' = E{ini),i — 
and the decoded message is rhi = D{Yf^) where Y 
is the channel output. The iterative mapping is illustrated in 
Fig 2 The average error probability is Pp = | Pr(riii ^ 

Definition 4 (IFB achievability). A rate R is iterated-finite- 
block (IFB) achievable over the channel Py|x^ if for any e > 
there exist k,b* > such that for any b > b* there exist an 
encoder E and a decoder D with block length k and rate R 
for which the average error probability in iterative mapping 
of E. D to b blocks is at most e. 

Note that this is equivalent to stating that the lim sup of the 
average error probability with respect to b is at most e. 

Definition 5 (IFB capacity). The IFB capacity of the channel 
Py|x is the supremum of the set of IFB achievable rates, and 
is denoted Cifb ■ 

D. Competitive Universality 

In the following, the properties of the adaptive system with 
feedback, and IFB-universality are defined. A randomized 
rate-adaptive transmitter and receiver for block length n with 
feedback are defined as follows: the transmitter is presented 
with a message expressed by an infinite bit sequence, and 
following the reception of n symbols, the decoder announces 
the achieved rate R, and decodes the first [ni?] bits. An error 
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Fig. 2. An illustration of iterative mapping used for the definition of average eiTor probability (see Definition |3|. The same encoder and decoder are used 
over each of the 6 = 5 blocks of fc = 10 channel uses, and the average error probability is computed. 
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means any of these bits differs from the bits of the original 
message sequence. Both encoder and decoder have access to 
a random variable S (the common randomness) distributed 
over a chosen alphabet, and a causal feedback link allows the 
transmitted symbols to depend on previously sent feedback 
from the receiver. See formal definitions in our previous paper 
||5]. The system is illustrated in Fig. [3] 

The following definition states formally the notion of IFB- 
universality for rate adaptive systems: 

Deflnition 6 (IFB universality). With respect to a set of 
channels {Pyix}; ^ G (not necessarily finite or countable), a 
rate-adaptive communication system (possibly using feedback 
and common randomness) is called IFB universal if for every 
channel in the family and any e, 5 > there is n large enough 
such that when the system is operated over n channel uses, 
then in probability 1 — e, the message is correctly decoded and 
the rate is at least Cifb(^y|x) — ^ ■ 

E. A discussion on IFB capacity and universality 

Following are some comments regarding IFB capacity and 
IFB universality. Note that the use of average error probability 
over time and messages (expressed in the assumed uniform 
distribution) rather than maximum error probability (over 
time or messages) reduces the requirements from E. D and 
therefore increases Cjfb- 

As noted, Cifb < C, where C is the Shannon capacity 
However for i.i.d. memoryless channels clearly Cifb = C. 
The difference between C and Cifb relates to the stability 
of the channel over time, and the ability to utilize channel 



structure which cannot be observed in finite time. Let us give 
two examples to sharpen this difference: 

Example 1. Consider the binary product channel yi = Xi ■ Zi, 
and let the sequence z alternate between and 1, in blocks 
of ever growing size, but such that the overall frequency of 
is ^, and the length of each blocks is negligible compared 
to the total length of previous blocks. For example, set Zi to 
in i e U^J2A:2, [k + if + k"^]. For this channel Cfb = 
while C = \. The reason is that for every finite length 
encoder/decoder, ultimately as to — cx) half the blocks will 
fall on bursts of z = and be in error Note that if rate 
adaptation would have been allowed at the IFB decoder, this 



capacity would not have been zero (see Section VII-A i 



Example 2. Consider a channel with ever growing delay: 
Suppose that di is a sequence of slowly growing delays. For 
example, di = [logij, and the channel is y,; = x^., where a;, y 
are binary. The capacity of this channel is C = 1, whereas 
Cifb = 0. Here, the reason for the gap is the in-ability to 
utilize the channel structure with a finite block size. 

Following these examples the choice of Cifb may be 
justified by two main reasons: one is its operational signif- 
icance, i.e. that universally attaining Cfb, means competing 
with every static block coding system, and the other is the 
rejection of eccentric behaviors of the channel, such as the 
ones mentioned in the examples above. 

Note that although Cifb < C, the universal system pre- 
sented here may opportunistically achieve rates above C. This 
means the communication rate may exceed C in part of the 
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time. Consider for example the binary non-ergodic channel 
that in probability p has y = x, and in probability 1 — p the 
output is independent of the input. Then while the capacity of 
this channel is C = (and Cifb = 0), by adapting the rate, 
one could attain a rate of 1 in probability p. 

An interesting question is whether for a general vector 
channel, Cifb can be universally attained. Unfortunately the 
answer is negative, and the reason is that since the input 
sequences used by the reference encoder and by the universal 
system are different, infinite memory in the channel may 
cause the channel to get "stuck" in an unfortunate state. This 
phenomenon may be nicknamed a "password" channel, since 
it is similar to a situation where a password is required at 
the beginning of transmission, otherwise the channel becomes 
useless. In this case, a reference system knowing the password 
may succeed and a universal system, having only one attempt 
to find the password, is bound to fail. In other words, given an 
encoder, a channel can be structured such that it will identify 
the specific encoder's codebook, and fail if any deviation from 
this codebook is observed. Here is a simple example: 

Example 3. Consider a family of two binary channels. In the 
first channel, if xi then the channel will become clean 
i > 2, i.e. yi > 2 : Ui = Xi, but if xi = 1, then it becomes 
blocked, i.e. > 2 : yi — 0. The second channel is the same, 
except the roles of 0, 1 are reversed. Clearly, for both channels 
Cifb = 1, since the only constraint required to avoid blocking 
is that the first symbol in each encoded block is constant or 
1, and therefore a rate of can be obtained with block size 
k. On the other hand, no universal system can guarantee any 
rate with a vanishing error probability, since any choice of the 
first symbol will lead to blocking in one of the two channels. 

The conclusion from the above is that the concept of iterated 
finite block capacity is not as strong as the concept of finite 
state compressibility, which is truly universally attainable. 
This problem relates to a fundamental difficulty in universal 
communication compared to universal compression: in uni- 
versal compression the sequence is given, one can compare 
different encoders operating on the same sequence, and the 
major difficulty is dealing with the unknown future of the 
sequence. In universal communication there is an additional 
difficulty because the encoder's actions (the input symbols) 
affect the channel behavior in an unexpected way. 

One may be tempted to think that depriving the IFB class 
from its block-wise operation and limiting it to i.i.d. distribu- 
tions would solve the "password" problem. However it is easy 
to devise a channel that would identify the input distribution 
of the reference encoder, while blocking the universal system. 
See Example |5] in Appendix |F| These difficulties exemplify 
the complexity of the universal communication problem. 

V. Universal communication over the the 

MODULO-ADDITIVE CHANNEL 

This section and the next, focus on the modulo-additive 
channel with an individual noise sequence. It is shown that the 
IFB capacity of this channel is bounded by (1 — p{z)) log \X\ 
and that this rate is universally achievable. Upper and lower 



bounds on the convergence rates are given, which show that, 
unfortunately, the transmission length n required to obtain 
universal communication grows exponentially with the block 
length k of the competing system. 

The modulo-additive channel is a relatively "easy" case 
because of two main reasons: 

• It is memoryless in the input, and thus the "password" 
issue is avoided. 

• There is a single input prior, the uniform i.i.d. distribu- 
tion, which attains capacity for any noise sequence (since 
it maximizes the output entropy), therefore no adaptation 
of the prior is needed. 

A. A bound on the IFB capacity of the modulo-additive 
channel 

In this section the following Theorem is proven: 

Theorem 1. The IFB-capacity of the modulo-additive channel 
y = x + z where x, y, z € X°° are infinite sequences denoting 
the channel input, output and noise sequence, satisfies 

Cifb < (I-P(z))- log lA-l (2) 

where p{z) is the finite state compressibility of z. 

For the sake of completeness let us shortly repeat the 
definition of finite state compressibility. A finite state en- 
coder F with s states is defined by a next state function 
g : {{1, . . . , s}, X) — > {!,..., s}, and an output function 
g : ({1, . . . , s}, A") — > {{0, Ij'^l^o' where the output may be 
a bit sequence of any length, including the empty sequence. 
The encoder is said to be information lossless if for any 
z", the input z" can be uniquely decoded from the output 
sequence, given the initial and terminal states. Let F{s) denote 
the group of all finite state information lossless encoders with 
at most s states. Let the length of the output sequence for an 
input sequence of length n be denoted i(F(z")), then the 
compression ratio of z" by F is defined as: 

The compression ratio of the best information lossless finite 
state encoder with at most s states is denoted: 

Pf(s)(z?) = min pf{z) (4) 

FeF{s) 

And finally, the finite state compressibility of the infinite 
sequence z — zf is defined as: 

p(z) = lim limsuppf^(5)(z") (5) 

Note that the order of limits is critical for this definition, since 
if the number of states is taken to infinity first, any sequence 
can be compressed to 1 bit (by having the state machine 
"remember" and identify the particular sequence). The outer 
limit exists, since Pe(s) is decreasing in s and bounded from 
below. 

Theorem^proof outline: Define ^ as the random vector 
of length k formed by selecting one vector from the set of b 
vectors (z/*^ ^.^j^)^^]^, with uniform probability of ^ for each. 
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In other words, the probability distribution of ^ equals the 
empirical distribution of the first b blocks of length k in z. 
Similarly define the random variables Xf, fc and Y^^k derived 
from the sequences x, y. 

Suppose a given E, D achieve rate R and average error 
probability e over b blocks of size fc. This is equivalent to 
saying they achieve error probability e when operating on the 
stochastic channel Y;, ^ — ^ + Zi,,k (Figure [sjl. Therefore 
the standard converse of the channel capacity theorem implies 

that the rate R can be bounded by i? < log \X\ — ^H{'Zn,^k)- 
Then, the limit of ^H{'Li, k) is related to the finite state 
compressibility p(z). The later relation is a variation of a result 
by Lempel and Ziv 161 Theorem 3] on the convergence of the 
sliding-window empirical entropy measured over increasing 
block lengths to the finite state compressibility (whereas here 
the block-wise empirical entropy is used instead). The full 
proof is given in Appendix-jA] 

Note that the upper bound of Theorem [T] can sometimes 
be strict, i.e. there are examples of sequences z for which 
CiFB < log \X\, as shown in the following example. 

We do not have an expression for the IFB capacity. 

Example 4. Consider for the binary additive channel, the 
sequence z which consists of blocks with ever increasing size. 
The first half of each block is 0, and the second half block 
is chosen randomly Zi ^ Ber{^). With high probability, the 
finite state compressibility of the sequence is ^ (which can 
be attained, for example, by block-to-variable encoding, using 
one bit to denote the sequence of zeros). However, the IFB 
capacity of the channel is in high probability, since for any 
encoder and decoder with large block size, approximately half 
of the blocks will be received in error. Therefore there exist 
sequences for which the inequality is strict. 

B. Universally attaining the IFB capacity over the modulo- 
additive channel 

In this section the results regarding a universal system for 
the modulo-additive channel with an unknown state sequence 
are presented. The basis is a result from |12||13|, that shows 
that for a wide range of sequential source encoders, there is 
a communication scheme that asymptotically attains the rate 
log \X\ — -L{z), where L{z) is the compression length of the 
sequence z by the source encoder (the number of bits used to 
encode the sequence). For completeness, the result is stated 
formally and proven in the appendix (Section |C] Theorem |4|i. 
Substituting the compression length of the Lempel-Ziv (LZ78) 



algorithm, the finite state compressibility is obtained (Theo- 
rem [2]l. 

A similar theorem was presented in our previous paper 
lfT2ll . As shall be seen, both assumptions are satisfied by 
Lempel-Ziv algorithms (LZ77 |T4l and LZ78 (6\). Note the 
similarity between the rate expression (|60| and the capacity of 
an ergodic stochastic modulo-additive channel (attained with a 
uniform prior) C = 7(X°°; = H{Y'^)~H{Y°°\X^) = 
logjA"! — H{Z°°). -L{z) can be considered a generalized 
empirical measure of the noise entropy rate. In this sense. 
Theorem]?] is a generalization of Shayevitz and Feder's result 
IT]. This result can be considered as a special case of the 
"individual channels" framework ll5l llT5l . The scheme achiev- 
ing the claims of Theorem ]4] is presented in Section C2 By 
showing that LZ78 satisfies the conditions, the following can 
be proven: 

Theorem 2. When the system of Theorem ]4] is used in con- 
junction with LZ78 source encoder, over the modulo additive 
channel, then the following holds: For every noise sequence z 
and every e,6 > there is n large enough so that when the 
system is operated over n channel uses, then in probability 
1 — e, the message is correctly decoded and the rate is at least 
{l-p{z))\og\X\-5. 

Corollary 2.1. The system defined above is IFB -universal. 

Corollary 2.2. The system attains the Shannon capacity of 
every modulo-additive channel with a stationary ergodic noise 
sequence^ 

The proof is given in Appendix ]B] 

Theorems 4]2 are finite horizon, i.e. the system is designed 
for a given transmission length n, and because n needs to 
grow in order to make the overhead 5 vanish, the asymptotic 
universality is obtained by a series of systems rather than a 
single one (as is standard in information theory). However, it 
is possible to design horizon-free systems in which the trans- 
mission length is not limited and redundancy vanishes with 
time 03]. The results of this section rely on LZ compression 
algorithm and stress the relations between channel coding rates 
and compression ratios, and between IFB capacity and finite 
state compressibility. This relation is intuitively appealing and 
the resulting system is relatively simple. On the other hand, the 
modified universal system presented in the next section yields 
better bounds on the convergence of the overhead terms, which 



Note there is an error in our paper IJH where it was claimed the system 
only attains the mutual information. 
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Fig. 5. A probabilistic equivalence to iterative mapping 



also hold uniformly in zj^ 



VI. The redundancy of the universal system 

Let us now consider the redundancy of the universal system 
and how fast it converges to zero as the block length increases, 
under the context of the modulo-additive channel. The inter- 
esting question is how large the transmission length size n 
needs to be, in order to successfully compete with an IFB 
system of a given block size k. Unfortunately, n must grow 
at least as fast as \X\'', approximately. Thus, even considering 
reference systems of relatively small block sizes compared 
to standard block codes (such as fc = 100), the competition 
becomes infeasible. 

A. A definition of redundancy 

Before giving a definition of the redundancy, some con- 
siderations for the definition are provided. The finite state 
compressibility p{z) used in Theorems 



2 is irrelevant for 



the analysis of convergence. This is because p{z) is an 
asymptotical value, and comparing the performance of a block 
source encoder or a finite state machine encoder on any finite 
block of n symbols, does not indicate anything about the 
final finite state compressibility. In other words, there is no 
guarantee on the rate of convergence of the lim sup„_^o^ in (|5]l. 
Consider as example a sequence z which is incompressible up 
to time ni and then all zero to infinity, or vice versa (by Kraft 
inequality, incompressible sequences must exist). Therefore, 
instead of considering the convergence of the rates obtained by 
the best IFB system and the universal system to (1— p) log \ X\, 
the comparison is between the rate obtained by the best IFB 
system of block size k, with a universal system, at time n. 

While the asymptotic results (Theorem [T[ Theorem |2]) 
require the error probability of both systems to tend to zero 
with n, at a finite block length a certain non-zero error 
probability would exist. In the two systems, error probabilities 
have different meanings: the IFB system's error probability 
is block-wise and the universal system's error probability is 
measured on the entire transmission. Therefore, for a fair 
comparison, and in order to remove the dependence on the 
error probability from the results, let us consider the following 
definition of an effective rate, for a system operating over 
block of size k with rate R and error probability e: 



R* = (1 



e)R~^hb{e) 



(6) 



^Notice that because Theorem [2] essentially indicates convergence to the 
IFB capacity, the convergence cannot hold uniformly in z, as the capacity 
may be obtained by competing systems of unlimited complexity, depending 
on the noise sequence. 



This definition is motivated by Fano's inequality (see (|47]l). 
While the first factor is usually termed the good-put (the 
number of error free bits.), the second factor compensates for 
the uncertainty in knowing whether there is an error or not. For 
example, a system delivering i? = 1 bit per channel use with 
error probability e = ^ per block of size k — 1 (i.e. transmits 
no information) would have (1 — e)i? — ^ but R* = 0. 
Equivalently, R* may be interpreted as the minimum value 
of normalized mutual information between the input message 
and decoded message, given the parameters R, e and k. 

Another issue is how to compare a universal system with 
transmission length n and an IFB system whose block length 
k does not divide n. For a worst-case comparison, let us give 
the IFB system the luxury of using the last block that possibly 
extends beyond the n-th symbol (i.e. I = [^] blocks overall), 
while letting the noise sequence on these symbols z'^^^i take 
the values which are best for the IFB system. 

A definition of the minimax redundancy is given below. 
Let E, D define an IFB system with block length k and 
rate Rifb (Definition |2]l, which is iteratively mapped to the 
modulo-additive channel with noise sequence z of length kl 
(where I = \'^^) yield average error probability Cifb 
(Definition |3]l. Similarly, on the same channel over n symbols, 
an adaptive system U with feedback and common randomness 
(Section [IV-D| |, whose design must not depend on z, achieves 
rate at least i?u with error probability at most e^. As in 
Definition |6] e„ includes both the probability of error and the 
probability that the system's rate falls below R^- While R^ is 
allowed to depend on the noise sequence z, is required 
to be fixed. Let R 



IFB 
1 . 



i?iFB(l - Eifb) - fc^b(eiFB) and 
i?* ~ i?u(l — Eu) — -^hbi^u)- The rate and error probability 
for each system, are defined given the noise sequence and the 
system. The values related to the IFB system, i?iFB, Eifb and 
i?*pj3 depend implicitly on {E, D, z), while the values related 
to the rate adaptive system i?u, and i?* depend implicitly on 
f/,z. 

The minimax redundancy for finite n, k is defined as fol- 
lows: 



A*{n,k) = minmax 
u z? 



max{R;^J - i?* 



(7) 



I.e. it is the minimal gap i?*p3 — i?* that can be universally 
guaranteed by a single system U over all noise sequences. 
Note that the definition allows the universal system to depend 
on k but this relaxation is not used by the universal system 
achieving the bounds below. 
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B. The minimax redundancy for the modulo-additive channel 
class 

The minimax redundancy of a universal system compared to 
the IFB system over the modulo-additive channel is bounded 
as follows: 

Theorem 3. The minimax redundancy for the channel y — 
X + z satisfies: 

A_ < A* < A+ (8) 



where 



A_ = 



and for t < 1: 



log {kr) 



log 1^1 
2k 



2\X\ 



T > 
T < 



k 
k 



(9) 



^t^+t\ loge + J:+ ^log(elA'l). 
4 / n 

(10) 



The parameters are defined as follows: 



s: = 4 



loglA-l •log(n2|A'|) 



(11) 



Furthermore, the system attaining the upper bound A4- does 
not depend on k. 

The theorem is proven in the next section. Note that both 
bounds require r to be small (and thus n to be large) in order 
to achieve a small redundancy. While the lower bound is linear 



for T < 



for large values, it increases significantly more 



slowly (like logr). This is because of the in-efficiency of the 
IFB system used in the lower bound, at high rates. The values 
of A_ in the range r < -L^, is limited to , i.e. a rate 

offset of half a symbol per block. The bound for the range 
T > ^ is useful, in showing that even if one is satisfied with 
a redundancy of more than , r must be kept small. Fig.|6] 
illustrates the bounds of Theorem [5] as function of the trans- 
mission length n, for a constant value of k. The logarithmic 
and quantized behavior of the lower bound for small values of 
n can be observed. Fig. |6] presents the minimum n required 
to obtain A*{n,k) < 6 ■ logjA"!, according to the bounds of 
Theorem [3] as a function of k. Although the gap between the 
upper and lower bounds is significant (a little more than an 
order of magnitude), the trend is similar. These observations 
are formalized by Corollaries 3.1|3.2 below, which treat the 
asymptotical behavior of n: 

Corollary 3.1. For a given k and S > 0, let n* — n*{k,S) 

be the minimum n such that A*(n, k) < 6log \X\. Then 



I-YI ' ' - - min[T(fc,<5,|A'|),l/fc] 

where g{T) = rlog (;^) and 



(12) 



r(fc, 6, \X\) = .9-1 (us - 12 • lA-r^/^) . log lA-l) (13) 





\ \^ 




\ \ 
\ 


A lower bound 




* 

A upper bound 





Fig. 6. The upper and lower bound on the redundancy A* (n, k) of universal 
systems given by Theorem [3] for k = 20, \ X\ =2. 




Fig. 7. The minimum transmission length n required to obtain a minimax 
redundancy A*(n, k) < S ■ log \X\, according to the bounds of Theorem[3] 
as function of the IFB block size k, for \X\ = 2,S = 0.01. 



Corollary 3.2. 



lim lim 

(5— >0 fc— >oc 



For large k 

-'[l5-log\X\) 



and 



\ogn*{k,5) 
fcloglA-l 

fixed 5, 



= 1 



(14) 



T{k,d,\X\) 
and does not dominate the 



g " (^|() • log 1^ I j — const, 

upper bound ( [T2] i. For a small value of 5 both bounds of 
Corollay 3.1 behave approximately like \X\^ . Corollary 3.1 
results from a technical simplification of the bounds of 
Theorem |3] and is proven in Appendix |D] Most important 
is the lower bound on n* which indicates the minimum 
rate at which n* must grow. Corollay 3.2 is an immediate 



consequence of Corollay 3.1 and formalizes the notion that 
n* grows approximately like \X\^ . 

Note that the system attaining the upper bound of The- 
orem [3] yields a stronger type of universality than claimed 
in Theorem |2] because for each value of n, the overheads 
are uniformly bounded for any noise sequence z, whereas 
previously, while the overheads are guaranteed to tend to 
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Fig. 8. The entropy of the noise in the test channel H{Z^) over time, and 
the lower bound of (??), (??) 



zero asymptotically with n, this convergence is not necessarily 
uniform with respect to z. 

It is interesting to note that, while in the previous results 
presented, the IFB system is used merely as a converse, in 
the proof for the lower bound A_ it is required to devise a 
specific IFB system. Here, the simplicity of the IFB system, 
which makes the other results intuitive and simple to derive, 
complicates the proof. The collapsed channel capacity, which 
upper bounds the IFB system rate, is usually not achievable by 
a finite block encoder, and a specific channel has to be devised 
in order for the IFB system to operate provably better than 
any universal system. It seems that richer classes of reference 
systems (e.g. systems using feedback as considered in |,16J ) 
may result in simpler and tighter lower bounds. 

C. Proof of Theorem [i] 

1) Lower bound (reverse part): In order to show that the 
redundancy must be at least O (^^^^) ™ example random 
channel is constructed, in the following way. First, the encoder 
E is defined. Then, a way to generate noise sequences z is 
defined, such that the noise sequences belong to a sub-set of 
all possible sequences z g Z^, and it is possible to decode the 
given code with zero error probability (for any noise sequence 
in the set). The IFB decoder D is specified only after the 
noise sequence has been chosen. The sequence z is drawn in 
a randomized way, thus creating a stochastic "test" channel. It 
is shown that there exists a noise sequence for which the rate 
of the universal system is bounded by the normalized mutual 
information over the test channel. Asymptotically, as there 
are certain constraints on the choice of the noise sequence, 
this normalized mutual information tends to the rate of the 
IFB encoder. However, at the beginning of the sequence, the 
entropy of the sequence is a little higher, and thus the mutual 
information is a little lower than the long-term average. Thus, 
the rate of the universal system is bounded by a value lower 
than the rate of the IFB system. 

Let us first describe the IFB encoder and the test channel. 
The encoder sends d symbols from the alphabet X over k 
channel uses, and therefore has a rate 



k symbols 



loglA-l 



(15) 



The encoding is simple: the first k — d symbols (prefix) are 
constant and the rest d symbols (suffix) contain the message. 
The decoder would be able to know the value of the noise 
sequence over the prefix symbols, and knows a list of all 
possible noise sequences. Assuming that there is no more 
than one noise sequence with any given prefix, then zero error 
probability is possible: the decoder finds the noise sequence 
from the prefix symbols, and cancels it on the suffix to find 
the message. 

The set of allowed noise sequences are simply those 
sequences for which each prefix j*"]^!^^^'' (i = 1, 2, . . .) is 
unique. The random noise sequence is generated as follows: 



























mi 


m2 


ms 


m4 



prefix: k — d symbols suffix: d symbols 

Fig. 9. The reference encoder 

at each block of k symbols, the prefix of fc — rf symbols is 
chosen randomly, uniformly over all possible prefixes, 
and independently of the past noise sequence. Then, if the 
prefix had appeared before, the suffix equals the suffix of the 
noise sequence that already appeared. Otherwise, the suffix is 
chosen randomly, uniformly over all possible \X\'^ suffixes]^ 

The choice of the first sequence -if^ is uniform over all 
possible sequences, and therefore the entropy of the noise 
sequence in the first block is maximal log(|A'|'^). The choice 
of the noise sequences narrows with time, and after a long 
while, all possible prefixes would have been chosen, with one 
noise sequence per prefix. In this case, the choice of the suffix 
is determined by the prefix, and the entropy per fc-block is 
log(| This is the minimum entropy per block attained. 

The behavior of the entropy i/(Z") in this channel is shown 
inFig|8] 

Now, because Eifb = 0^ 



Therefore: 



A*(n,fc) 



1 



/ih(eiFB) — ^IF 



^-loglA-l 



mm max 



max {R* 

E,D ^ 



,{E,D))~Rl 



> R* 



mm mm 



(16) 



(17) 



where R*^^ denotes the value defined in ([T6| for the specific 
reference system described. 

The universal system guarantees error probability for any 
z. By definition, for any z" € Z^, R^ > Ra = miuz^^gz^ [-Ru]- 
Therefore if z" is drawn randomly in Z^^, then the universal 
system yields a rate of at least _Ro^ with error probability 
at most e„ over the test channel, and can be converted to 
a fixed-rate system with feedback with rate Rq over the same 
channel. Using Fano's inequality, which holds also in the case 
of feedback (see ( [47] ) in the proof of Theorem [T] and ([39|), 

1. , . ,1 .... 1^(2"), 

' (18) 



Roil-e„)--ht{e^) < -/(X";Y") <log|A' 
n n 

and therefore for any universal system U : 

[<] = (!-< 



mm 

z'i'ez 




n 



(19) 



<log|A'|--iJ(Z") 



^An alternative way of generating the noise sequence, which yields the 
maximum entropy, is by uniform drawing over the set of all possible A;-length 
sequences that satisfy the unique prefix condition. However this complicates 
the bound. 
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which yields the bound: 

A* R* 



m 1 



i/(Z") 



-loglA-l 
k-d 



-i/(Z") 



(20) 



k 



\og\X\ 



Asymptotically, iiJ(Z") ^loglA"], and thus the 

bound above tends to 0. The main point of the proof is to 
bound the convergence rate of ^H{Z"). 

It may appear surprising, that while it will be shown that 
the mutual information over the channel is slightly lower than 
the IFB system transmits rate R*^^ with zero error 
over this channel. The explanation for this is that the decoder 
is designed knowing the noise sequence, and therefore its 
effective rate is not limited by the mutual information. 

The rest of this section is dedicated to the rather technical 
bounding of i7(Z"). The entropy of each prefix, conditioned 
on the past is log|A'|'=-'' = {k - d)\og\X\ = k ■ Hi, 
where Hi is the asymptotical entropy rate per symbol. The 
entropy of the suffix, given the past, changes over time. When 
choosing the i-th noise sequence (of length k), at most i — 1 
different prefixes akeady appeared. Therefore, the probability 
that the i-th prefix equals one of the previous ones is at most 



Let us define i*, — \X 



k-d 



and consider first the 



case i < i*^. In this case, the entropy of the suffix, given 
all previous symbols, is with probability at most 



and log|A'|'^ with probability at least 1 
therefore at least ^1 — 



and is 

• d ■ log \X\. Formally, define 



\x\>'-\ 

iS) A (zfl as the i-th prefix and 

suffix, and Fi = ljj=i |^z*^ =Pi"'^| as a flag indicating 

whether pj'' appeared before. Then the entropy of the suffix 
given the past is: 



-i)fc 



,P«) 



ii(5«|zi*-i)'^+'=^'') = H{s^'>\z^; 

H{Si''>\zt'^\pP,F,) 
= |Z^'-^)^P«,P, = 0) • Pr(P, = 0) 

+ i^(5«|zf-l)^P«,P, = 1) •Pr(P, = 1) 
= d-log|A'|-Pr(P, = 0) 

>d-log|A'| ^'^^ 



(21) 



1 - 



k-d 



where (a) is because Fi is a function of z[" Pi'\ There- 
fore 

ir(zfi|z('-i)'=) = i7(pW|z('-i)'^) + i?(S'i*)|pi'\z(*-i)'=) 



> d-log\X\ 



1 - 



i - 1 

JxW^ 



k ■ Hi 



For i > i'd simply bound: 

i?(zfi|Z('-i''^) >H{P^'^\Z'-'-^^'')^k-Hi 



(22) 



(23) 



Notice that Hi determines the asymptotical entropy rate of the 
sequence, and matches the bound on the universal system and 
the rate of the IFB system. 



For i < \X 



k-d. 



H{Z''') = ^H(zfl|Z(^'-i)'=) 



d-\os\X\ ■ 1 



\X 



k-d 



k- Hi 



2\X\ 



= d■\og\X\■{^-^^—^^]+^k■Hl 



d-loglA-l-ill- ' ]+tk-Hi 



> h ■d-\og\X\+ik ■ Hi. 



(24) 



The above implies that the entropy, at times n = ik < i*Jt, is 
bounded above the straight line with slope 



Ho^ — -\og\X\+Hi. 



(25) 



See Fig(8] For i > i*. 



W|,7(t-l)feN 



H{Z'^)=H{Z'd^)+ i7(Zr'|Z 

i=»*+i 

k\X\^-'^Ho + k{i - \X\''-'^)Hi 



(26) 



and in general 

H{Z'^) >ik-Hi+ min(i, \X\''~'^)k(Ho - Hi). (27) 

Consider now i?(Z") for n that does not, in general, divide 
by k. Inside the block of length k, the per-symbol conditional 
entropy iJ(Z„|Z"~^) is logjA"! during the prefix, and then 
increases at a smaller or equal rate during the suffix. Therefore 
the entropy H{Z") is concave during the block (Figjsjl. 
Because the entropy at block edges is bounded above straight 
lines ( |27] i, the entropy inside the block is bounded by these 
lines as well, i.e. ( |27] i can be extended to: 

i7(Z") >n-Hi+ min(n, k\X\''-'^)(Ho - Hi). (28) 



Substituting in ( |20| i yields: 

A* > -i/(Z")- ^^loglA- 
n k 

P)— / klXl''-'^' 
^ Hi+mmll,-^ 



(251 . ( k\X 

mm 1, 



k-d 



2k 



(Ho -Hi) -Hi (29) 



log I A- 1 . 



The bound is true for every d E {!,..., k}. Let us find a 
value of d that approximately maximizes the bound for given 
n, k. Starting from d = k and decreasing d, each decrease of 1 
doubles the first term in the RHS of (|29|, as long as 



< 



1, and only linearly decreases the second term. Therefore it 



is beneficial to decrease d as long as 



k\xl 



< 1, and no 



more than one additional step. For simplicity let us always 
take the additional step and determine d as the maximum d E 
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{1, . . . , fc} SO that ^^^^^ — > 1, or d = 1 if no such d exists 
i.e. 

, fk\X\''\ 1 
log 



d — max , , , , , 

loglA-l 

If n > k\X\''-^, then d = 1, and min ( 1, 
In this case (|29| yields: 



1 



(30) 



• /i fe|A'|'"-''\ 



k\Xl 



\y\k~l 

A*>i^L.log|A'| 



n 



Otherwise, min ( 1 



A* > i 



kixi" 



log 



= 1, and (|29]l yields: 



k\X\ 



1 



loglA-l 



log 1^1 
k 



(31) 



(32) 



Equations ( (3T| ), (|32]| are represented in a compact form in (|9]) 
above. This proves the lower bound of Theorem [3] □ 
2) Upper bound (direct part): The purpose is to show the 
existence of a universal system that attains a small redundancy 
with respect to the reference system, which is a result similar 
to the one of Theorem [2j however with a more careful analysis 
of the overheads. Following the same logic as the proof of 
Theorems T]2 the difference between the good-put of the two 
systems is bounded by the following relations: 

• The relation between i?* and the ideal i?omp target 
of the rate adaptive system (i.e. the overhead term of 
Theorem |4| 

• The relation between i?cmp and the collapsed channel ca- 
pacity (equivalently the collapsed noise entropy i?(Z; fc)) 

• The relation between R^^^ and H{Zi^k) obtained using 
Fano's inequality (as in the proof of Theorem [TJl 

Considering the scheme that was described for the achiev- 
ability result of Theorems 4j2 the largest overhead is due 
to the second element. This large overhead is in some sense 
unavoidable, as the converse shows, however it is especially 
large due to the use of LZ78 algorithm which has a slow 
(9(1/ log n) convergence rate. Specifically, using 16] Thm 1,2], 



this term i.e. the bound on -Ljs{z) 



log n 



y^b.k) behaves like 



i.e. in order for this term to be small, it is 



not only required that n ^ \X\ , but that this relation holds 

in the logarithm. For example, to have '"^^^j^ ^ = ™® 
needs n = \X^^'^ . 

To obtain a tighter bound, a more general result from 
|[T3l can be applied. fTT Thm. 8] shows that for every 
causal probability distribution P(x|y), i.e. satisfying for all 
i < n: P(x%") = P(x*|y*), the rate function i?omp = 
n ^Q^(x") adaptively achievable with overhead of (5„ = 

3y , where is the minimum non- 
zero value of (5(a;i|xi_i)|^ 

Substitute as Q the uniform distribution Q(x*) = \X\^'^ 
having q^.^l^.^ ^ \X\. Take P(x|y) = Pz{x—y), for some prob- 
ability distribution Pz(z). This choice satisfies the causality 
condition and yields 

1 



Rr 



= log|A'| + -logPz(x"-y'' 



(33) 



log I AT I • log 

with 5n ~ — While the convergence of 

S„ — > requires ejj to decay subexponentially with n, the 

choice of eu will lead to a reduction of ejyPomp < <^u log \X\ 

in rate. For simplicity let us choose ejj ~ - as this factor is 

1 

insignificant. In other words, the exists a system with eu = - 
which with probability 1 — ejj transmits a rate Pcmp — 
without error over the channel. Therefore 

R*, = Pu(l - £„) - -hb{e„) 
n 

> (i?omp - Sn){l - £„) /ib(eu) 



„p<log|A'|.(33) 

> 



eu log I A- 1 



1 



hb{ev) 



> Rc 



> R. 



log\X\ 


■ log 


{n^\X\) 


n 


log\X\ 


• log 


in^\X\) 




n 





-log\X\-- 
n n 



=K 

(34) 

where in the last step, for simplification of the bound, it was 
assumed that n > logjA"! (otherwise, (5„ is large). 

If one is interested in competing with an IFB system with 
block length k, it would make sense to treat each k symbols of 
the noise sequence as a single super-symbol, and take as Pz 
the universal distribution defined by Krichevsky and Trofimov 
liri over these super-symbols. This distribution is universal in 
the sense that up to a small overhead, — ^ logP2(z) H{z), 
i.e. the probability matches the empirical entropy of the 
sequence, which in the current case is H{Zi,,k)- Furthermore, 
this holds with a redundancy close to the minimum possible. 
It is possible to construct a universal distribution Pz that 
compares well with all distributions over the n symbols which 
are i.i.d. over fc-length blocks, by a weighted average of 
Krichevsky-Trofimov distributions. 

Let 7rfc(z'^) denote a distribution over the fc-letter z'^, where 
k is not assumed to divide n. This defines also a distribution 
on the partial sequence of length i < fc by taking the 
marginal 7rfc(z') — J^z,'' ^fc(^'')- The distribution over n 
length vectors, associated with TTfe is defined as the i.i.d. 
extension of TTfc, where the marginal distribution is used for the 
remainder that does not divide by fc. This n-length distribution 
will be denoted by the same symbol: 

Ln/fcJ 

7rfc(z) ^ n ^^(4^^ ■ ^feK«AJfe+i) (35) 



Then, by weighting Krichevsky-Trofimov distributions it is 
possible to obtain the following result: 

Lemma 1. There exists a distribution Pz {zi),z e X", such 
that for all k for which r = ^^^-L < 1; 

Wk : -log^fc(z") < -logPz(z) + A,(fc,n), (36) 
n n 



where 



Substituting c(fb = 1, -D = in tlie parameters of tlie tlieorem. 



A..Ilog(l) + (^r^ + r+^noge (37) 
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The detailed derivation and proof appears in Appendix |E] 
The next stage is to relate 7rfe(z") to i7(Z;_fc). Let -if^ = 
*^ fc-block of z. Recall that I = [f ] is 
the number of fc-blocks that cover the n symbols, and 2;.^ 
is a random variable generated by uniform selection out of 



Jfe] 



■^i , . . . . Let P^;, J. be the distribution of Z; ^ which is the 



empirical distribution of z!^ 



[fe] 



E ^z,,(a)logF^, Ja) 



= --maxlog7rfe(Z;^ ) 

(b) 1 

> - - maxlog7rfe(z") 

t IT 



(38) 



(log Pz (z) +nA^(fc,n)) 



where (a) is because the empirical distribution maximizes 
the joint distribution of the vector; the expression following 
(a), where the maximization is over all fc-letter distributions 
TT, could be considered an alternative definition of iJ(Z/ 
Transition (b) holds because extending the vector reduces its 
probability (see also the definition of TTk{z^) (|35]l). Finally, by 
Fano's inequality (see (|47| in the proof of Theorem [l]| 

Kfb - -RiFbII - CiFb) - ^/ib(eiFB) < log \X\ - ^i?(Zi,fc) 

(39) 

Combining the above yields 

fl 
_ \og\X\--HiZi,k) 



log + (logPz(z) + nA,(fc, n)) 
n ( 1 



kl ~ n 
kl 



log|A'| + A,(fc,n) 



(40) 



n k 
n 



7 7 ^ ^-cmp 

^-log|A'|+A^(fc,n) 



g(K + <5:) + ^log|A'|+A^(fc,n) 

<i?* +J: + -log|A'|+A,(fc,n) 
n 

Since this holds for any noise sequence and any pair E, D, 



A*{n,k) ^ max {R;^^ - R*,,) ^ A^(fc, n) + 5^ + - log lA"] 

z.,E.D n 

(41) 

This proves the upper bound. □ 

VII. Discussion AND EXTENSIONS 

The model presented in this paper supplies the first defini- 
tion of a "universal communication system", and the results 
indicate that such universal communication with feedback is 
possible in the non trivial example of the modulo additive 
channel with an individual state sequence. 



A. Alternative definitions of universality 

The IFB comparison class was chosen as the perhaps 
simplest and most intuitive comparison class for universal 
communication. However, it has several drawbacks: 

1) The reference system is limited in terms of complexity, 
feedback, etc. 

2) On the other hand, universality is only achieved at ultra- 
high values of the transmission length n. (btw, this is no 
different from LZ) 

3) The definition motivates learning fc-periodic structures 
in the channel, which is counter intuitive. This may 
be solved e.g. by starting the reference system at an 
arbitrary time (rather than n — 1), or by using structures 
that are not periodic (e.g. finite state machines lfT6l ). 

4) While the IFB capacity is limited by the "collapsed 
channel capacity", it usually falls short of it. Further- 
more, had the channel been a stochastic memoryless 
one, a rather large block size would be needed for the 
IFB system in order to yield a small error probability. 
A possible solution is to define the collapsed channel 
capacity itself as a target rate, but it is not clear how 
this should be defined for channels with memory. 

1 ) Possible enhancements of the IFB class: In this section 
several possible extensions in terms of complexity are con- 
sidered. While there are reasons for such extensions, in our 



opinion, due to the convergence rate issues (Section VI i the 
right direction would be to simplify rather than enrich the 
comparison class. 

Since the reference system enjoys the advantage of being 
designed for the specific noise sequence, this advantage is 
compensated by imposing some restrictions on the reference 
system, which are not imposed on the universal system. This 
is similar to what is done in universal source coding and 
universal prediction, when the comparison class is too rich. 
The definition of Cjfb limits the reference system in several 
factors, where the universal system is not restricted. Namely 
its complexity, the use of feedback, common randomness and 
rate adaptivity. Relaxing any of these factors, may generate a 
higher value of the target rate (an alternative to Cjfb) which is 
still universally attainable. Some potential variations are given 
below: 

1) Randomness: allowing the reference system the use of 
common randomness. 

2) Rate adaptivity: allowing rate adaptivity in various lev- 
els. Error detection can be considered a very basic level 
of adaptivity. 

3) Complexity: definition of the encoder/decoder as finite 
state machines rather than block encoders/decoders. 

4) Feedback: allowing the use of (a possibly limited amount 
of) feedback for the reference system. 

The first two extensions (1),(2) are trivial, and are not 
pursued here in order to simplify the presentation. Misra and 
Weissman had presented |16 | a class of finite state machine 
encoders and decoders with feedback (termed FS class) that 
includes all the enhancements above, and had shown that for 
the modulo-additive channel, the maximum rate achieved by 
the reference class is at most i? = (1 — p(z)) log \X\, so the 
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current result on universality would hold also with respect to 
this enhanced class. Furthermore, they show that, unlike the 
IFB class (j |IV-E| i, the FS class achieves the rate R when the 
complexity is allowed to grow. Notwithstanding these results, 
the IFB class is still of interest due to its simplicity, which 
allows simple analysis and consideration of more complex 
channel models ifTOl . 

Below, these extensions are briefly discussed. Although 
these extensions have already been made by Misra and Wiess- 
man in the context of the modulo-additive channel, it is 
interesting to consider them also in the context of more general 
channel models. 

Common randomness: Allowing the reference system the 
use of common randomness does not change the results, as 
long as the common randomness is independent of the noise 
sequence and/or the block number This is because the IFB 
capacity would still be upper bounded by the collapsed channel 
capacity. This holds also for channels with fading memory 
ifTOl . where the collapsed channel capacity is used as a bound 
for the IFB rate. 

Rate adaptivity: The IFB system may be allowed to choose 
the transmission rate adaptively at the decoder A simple 
form of rate adaptivity is error detection, i.e. the decoder is 
allowed to choose between rate R and rate 0. In the later 
case, decoding errors are ignored. On the other hand, the IFB 
rate is defined in an effective way, considering how many 
blocks were actually decoded. Under suitable definitions, the 
effective rate of the IFB system would still be bounded by 
the collapsed channel capacity, so the results easily extend. 
Note that allowing error detection effectively models a block 
coding system using automatic repeat request (ARQ). Note 
that in any case of rate adaptation, for a fair comparison, the 
decision on the rate must be made at the decoder based on the 
received sequence alone. 

Complexity: In order to achieve competitively universal 
communication, it is essential that both the reference encoder 
and the decoder be limited in some way (assuming they are 
designed knowing the channel). Consider for example, the 
modulo additive channel. If the encoder is not limited, then it 
can transmit data at the maximum rate logjA"! bits/channel 
use, by uncoded transmission and subtraction of the noise 
sequence at the encoder In this case, the decoder does nothing 
essentially, so no restrictions on the decoder are not helpful. 
Conversely, if the decoder is not limited, the encoder can 
transmit un-coded and the noise sequence can be canceled 
at the decoder, so the same happens while exchanging the 
roles. As mentioned, an extension to finite state machines with 
feedback (FS-class) has already been made llT6l . An interesting 
issue for further study is the universality with respect to the 
FS-class in general channel models. 

Feedback: Several types of feedback may be considered: 

1) Feedback inside the block (where the state is reset 
from block to block). Because the collapsed channel 
is a channel with memory, feedback can increase its 
capacity. Since the increase in capacity is obtained by 
changing the input distribution (prior) in response to 
feedback (information on channel state), in order to 
complete in this case, the universal system would also 



need to adapt its input distribution per symbol. Hence, 
the universal systems presented here and in []fTO'| are 
not suitable for this setting. However, for the modulo- 
additive channel, feedback does not increase capacity, 
because the best input distribution is uniform regardless 
of any knowledge on channel state (in other words, as 
easy to see, the bound based on Fano's inequality ( |47| ) 
would hold regardless of feedback), and in this particular 
case, the results do extend to the case of feedback inside 
the block (see also lfT6]| ). 

2) Feedback between blocks (encoder of block b receives 
a message from decoder of block h — 1). This kind of 
feedback effectively increases the block size of the IFB 
system, as it allows it to keep track of the block index 
to some extent by passing it back and forth between 
the encoder (through the channel) and the decoder 
(through the feedback link). Of course, this cannot be 
continued when the number of bits required to represent 
the block index is larger than k\og \ X\. In the modulo- 
additive channel, knowledge of the block index yields 
the maximum capacity of logjA"!. It is interesting to 
note that, while such feedback seems to considerably 
strengthen the IFB system, Misra and Weissman |16l 
showed that the rate of the FS-class is limited in spite 
of feedback. This is because the restriction is on the 
number of states rather than on the block length. 

3) It is possible to allow the reference system the use of 
asymptotically zero-rate feedback (which does not con- 
siderably increase the effective block length and cannot 
considerably increase the collapsed channel capacity). 

2} An alternative comparison class: As mentioned a rel- 
atively short block size, limits the IFB class from attaining 
the collapsed channel capacity. This gap is not utilized in the 
current bounds. The collapsed channel capacity bound would 
still hold, if the encoder and decoder were allowed to operate 
over multiple blocks, but treat each block in the same way. 

One option to define this class is to limit the encoder to a 
random encoder over the entire transmission length n, with an 
i.i.d. prior of choice (alternatively, i.i.d. in blocks) and limit 
the decoder to use a memoryless decoding metric (or more 
generally, alpha decoding, i.e. type-based decoding). Another 
similar way is to let the encoder and decoder be general 
but randomly permute the inputs and outputs of the channel. 
As before, the reference encoder and decoder are limited, 
but are designed based on full channel knowledge. For the 
modulo-additive channel, it is easy to see that in both cases, 
the reference rate would be limited to logjA"! — H{z). It is 
more interesting to discuss these classes in the case of general 
channels - see ifTOl . Note that although these reference systems 
would fail for the password channel defined in Example [3] 
it is possible to devise an alternative example, showing that 
universal communication with respect to these classes over 
general channels is not possible ifTOl . 

B. Other comments 

Theorem |4j connecting the transmission rate to the compres- 
sion rate of the noise sequence is reminiscent of Ahlswede's 
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channel coding scheme with feedback ifTTll . This scheme sends 
information by iteratively compressing the receiver's uncer- 
tainty with regard to the transmitted message. Indeed, Ooi 
ifTSl has used this scheme in order to achieve adaptive commu- 
nication over compound channels, including compound finite 
state channels. Ooi assumes a compound channel (probabilistic 
with unknown parameters) and varies the rate by changing the 
transmission length, while here an individual noise sequence 
is considered and the rate is varied by changing the number 
of bits transmitted (using a variable block length is a simpler, 
particular case, that can be obtained by transmitting a single 
block, in the scheme presented here). Adapting Ooi's scheme 
to the individual noise sequence channel seems complicated 
while using random coding yields a simple proof for the 
current result. 

The result is also closely related to Ziv's result fT9l regard- 
ing universal decoding over compound finite state channels. If 
Theorem |4] is particularized to the non-adaptive case, then it 
can be proven and generalized by the tools used there. The de- 
coder in Ziv's paper uses joint Lempel-Ziv parsing and yields 
a decoding metric which generalizes in a sense the metric used 
here, for channels which are not necessarily memoryless (see 
also Section ??). Theorem 2 and particularly Lemma 1 there, 
relate the size of the error sets Mo,Af„ (defined there) for 
the maximum likelihood decoder designed for the finite state 
channel, and the universal decoder. This relation indicates the 
rate that can be achieved with a given error probability is 
asymptotically the same. Furthermore, the only assumption 
used about the reference (maximum likelihood) decoder is that 
it uses a finite state metric (see the proof of Lemma 1 there), 
and thus the IFB decoder falls into this class. 

In a previous paper [51 a different framework, termed 
"individual channels" was considered, in which no relation 
between the input and output of the channel is assumed a- 
priori, and the communication rate is given as a function of 
the input and output sequences. As an example the empir- 
ical mutual information /(x, y) is shown to be achievable. 
The current results fall under this category - i.e. under the 
definitions of "individual channels" |5||20| the rate func- 
tion i?oi„p(x,y) = loglA"! - ii(y - x) is asymptotically 
adaptively achievable (i.e. by an adaptive rate system). Note 
that one do not need to assume that the channel is truly 
modulo-additive to show this. It is also possible to show 
ll20il that all achievable rate functions that depend only on 
the noise sequence i?cmp(x,y) = R{y — x), are of this 
form (asymptotically), i.e. given a system attaining the rate 
_Rcmp(x,y) = /(y — x) for each individual sequence with 
a uniformly distributed channel input, there exists a source 
encoding scheme with encoding lengths asymptotically ap- 
proaching L{z) — n\og \X\ — nR{z). 

In previous works fl], iH, rates which reflect the average 
channel behavior such as 1 — H{z) were termed "empirical 
capacity" mainly based on the similarity to the capacity ex- 
pressions for memoryless channels. The term is not completely 
justified, since clearly this is not the maximum communication 
rate. The value CjpB seems to be a better candidate to describe 
the modulo-additive channel's "empirical capacity", although 
as discussed above, other interesting definitions can be sug- 



gested. Note that there is no fixed order between 1 — hb{e) 
and CiFB- For example for z = 0, 1, 0, 1, 0, the relation is 
= 1 — /ih(e) < CiFB = 1> while in Example |4] the order is 
inverse = Cifb < 1 — hi,{e) — 1 — hi, (^). On the other 
hand the relation 1 ~ hi,{i) < 1 — p{z) always hold^ so the 
rates achieved by the scheme described here are asymptotically 
better than the previously achieved rates [IJ. 

Note that the current results assume the noise sequence 
is fixed and unknown, and cannot be extended to the case 
where the noise sequence is determined by an adversary 
(i.e. Zi is a function of x]^^), and the reference class is 
aware of the adversary strategy. To see this, it is very easy 
to design an adversary that identifies the codebook used by 
the reference encoder, and locks the channel (by choosing 
the noise sequence randomly) once a different channel input 
appears. 

Since the current results are focused on the modulo-additive 
channel, an interesting question is: can the IFB capacity be 
universally attained over other classes of channels? The results 
shown here rely on the fact the modulo additive channel is 
memoryless in the input, and its capacity is obtained by a fixed 
prior. The former restriction is more crucial and the second can 
be alleviated f2T\. In flOl the current results are extended to 
the case of channels with memory, under a restriction that this 
memory fades with time. 

VIII. Conclusion 

This paper considered target rates for universal systems with 
feedback and focused on the modulo additive channel. The 
notion of the iterated finite block capacity, denoted Cifb, was 
defined for a vector channel, as the highest rate achievable by 
encoders and decoders that may be designed for the particular 
relation that exists between the input and output, yet are 
constrained to be of finite block length and use the same 
scheme over each block. The IFB capacity Cifb was used 
as a target communication rate to be achieved without any 
prior knowledge of the channel, using feedback. It was shown 
that Cifb Cannot be achieved universally for completely gen- 
eral input-output relations, however for the modulo-additive 
channel with an individual noise sequence, it can be achieved 
universally without knowing the noise sequence. Specifically, 
it was shown that Cfb < (1 ~ p) log l-^l, where p is the finite 
state compressibility of the noise sequence, and a universal 
system with feedback attaining a rate of at least (1 — p) log \X\ 
was presented. This result is relatively simple due to the 
properties of the modulo additive channel. In a follow-up paper 
|[rO| the result is extended to more general channels. 

Appendix 

A. Proof of Theorem [7] 

Suppose a given E, D achieve rate R and average error 
probability e over b blocks of size k. Let us adopt the 



definitions of X, 



b.k. 



-'b,k 



and Y;, k from Section 



V-A 



and 



likewise define m and m to be random variables generated 

^This can be shown by block to variable encoding to rate hi,(ei). where 
is the empirical probability of 1-s in the block, and the convexity of hi,{-) 
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by selecting the block index uniformly over 1 , . . . , 6 and 
taking the respective encoded/decoded (resp.) messages, i.e. 
m = m[/, m = rhu, where U ^ U{1, . . . ,b}. Then 



length is therefore: 



1 ^ 

-Pr(m, ^ m,) = ^ Pr(m, ^ m,)Pr(C/ = i) 



(42) 



Pr(rh 7^ m) < e 



The rate R is now bounded by the entropy of Zb fc. By Fano 
inequality 

iJ(m|m) < hb{e) + elogAf (43) 
Therefore by the information processing inequality 
/(Xb,fe; Yb^fe) > /(m; rh) = i/(m) - i/(m|m) 



> logAf - (/tb(e) +elogM) 



(44) 



On the other hand 



= HiYbM) - H{Zb,k) (45) 

Combining the two: 

(l-e)logM-/ib(e) </(Xb,fe;Yb,fc) < klog\X\ - H{Zb,k) 

(46) 

Therefore 



i? < -logM < (1 - e)"i 
k 



\og\X\~^H{Zb,k) + lh{e) 



(47) 

If B is an achievable rate then by Definition [4] for any e > 
there exist fc > such that (47i holds for this k and b large 
enough. Therefore taking liminft^oo on both sides yields: 



i? < (1 



loglA-l - ^limsupiJ(Zb,fc) + J/ib(e) 



(48) 

Next, let us relate H{Zii.k) to the finite state compressibility. 
There exists a finite state machine F with Sk — X^^^ ■ k states 
that compresses the sequence zj^™ to at most b- {H{Zi,.k) + 1) 
bits. This state machine implements a block to variable en- 
coder tuned to the empirical distribution and is structured as 
follows: its state space includes a counter from 1 to fc which 
counts the index inside the block, and a memory of fc — 1 
input characters. When the counter reaches k the machine 
outputs an encoded string, and the counter returns to 1. In 
the other counter states the machine emits the empty string. 
The encoded string is generated by a simple block to variable 
encoder optimized to compress the random variable Zk.b to its 
minimum average length (e.g. a Huffman encoder, although a 
simple encoder using lengths [Pr(Zfc t)^^] is sufficient for 
this purpose), and therefore its average encoded length for 
Zfj.k is at most H{Zi,,k) + 1 1221 Section 5.4]. The encoding 



^L(F(z[ti)fe+i)) 

i=l 

= J2 b-P{{2,\U)k+,)U=i)-L{F{i)) 
= 5- ^ VY{Zk,b = i)-L{F{i)) 
< b{H{Zb,k) + 1) 



(49) 



Therefore for n ~ bk 



m^.)W)</'FK) = ,^i^Wzi)) 

< —^biHiZbM) + 1) = TTZ^(H{Zb,k) + 1) 



nlog\X\ 



klog\X\ 



(50) 



The condition n — mk may be relaxed and the inequality may 
be applied to any finite n, taking & = Lf J (since if the last 
block is unfinished it will not contribute to the length, and the 
normalization hy n> bk wiU only decrease the LHS). 

limsuppF(sfc)(z'i) < limsuppj?(z") 



< 



— (limsupi?(Zb,fc) + 1) 

fcloglA"! b^oo 



p(z) = lim limsuppi?(s)(z") 

S ^OO ^ — 

< limsuppF(sfc)(^r) - rr— n7T(l™supi?(Zb,fe) + 1) 

ri— i-oo K log \/C I b— i-oo 

(52) 



Combining the above with ( 48 1 yields: 
ye ■.3k: 

R<{l-e)-' 



\og\X\ - ^linisupi7(Zb,fc) + J/ib(e) 



<(i-e) 



log\X\~log\X\p{z) + ^ + ^h{e) 



(53) 



Since the k obtaining the requirements of Definition |4] 
may be small, the factor ^ on the RHS makes the bound 
loose. To tighten the bound the following argument is used: 
choose a number j > 0. If there exist E, D with block 
size k and average error probability e over b large enough 
which divides by j, then by treating at each consecutive 
j blocks as a new block (and forming the encoder and 
decoder with block size j ■ k hy using j times the original 
encoder and decoder), then by the union bound if denote 
the error probabilities over the blocks i G {l,...,b}, the 
error probabilities of the aggregate encoder and decoder will 
satisfy e[ < J2d=i ^(2-i)j+rf' therefore the average error 
probability will be e' = J2''t=i < < f ELi = ' The 



17 



conclusion is that if the requirements of Definition |4] are met 
for a certain e, k, they are also met for j ■ e,j ■ k. Therefore: 



yj,e:3k: 



{l-p{z))\og\X\ + -il + hk{je)) 



(54) 

Note that Definition |4] requires the rate to be achievable for 
any e > 0, and therefore it is possible to take e — > 0. By 

choosing for each j, e — jj, denoting kj as any k that satisfies 
(|54]l for this j, and taking the limit j — oo yields: 



R< lim 



(l-p(z))log|A'| 



jkj 

= (l-p(z))log|A'| 
which by Definition |5] proves the theorem. 



(55) 



□ 



B. Proof of Theorem [2] 

To prove Theorem |2] it is first shown that LZ77 f\A\ 
and LZ78 |6| fulfil the requirements of Theorem |4] Both 
algorithms operate by creating a dictionary from previous 
symbols in the string, compressing a new substring to a tuple 
containing its location in the dictionary, plus, possibly one 
additional symbol. In LZ77 the dictionary consists of all 
substrings that begin in a window of specified length before 
the first symbol that was not encoded yet. LZ78 parses the 
string z into phrases. Each phrase is a substring which is not 
a prefix of any previous phrase, but can be generated from 
concatenating a previous phrase with one additional symbol. 
The dictionary contains all phrases. 

It is easy to make sure that Lt is monotonous (require- 
ment (2) of Theorem |4|. This depends on the way the last 
phrase in the string is treated (and does not affect the asymp- 
totical performance), since this phrase may be an incomplete 
substring of a string in the dictionary, and therefore does 
not naturally terminate and produce a tuple. If, for example, 
the last phrase is sent without coding, then Lt will not 
be monotonous (since adding more symbols to z that will 
terminate the phrase will result in a shorter compression). A 
simple treatment is to encode the last phrase similarly to other 
phrases - refer to one of the phrases in the dictionary which 
is a prefix of the remaining substring, and always give the 
length of the last substring (or the length of the block) at the 
end. This way the compression length associated with the last 
substring does not decrease when the substring is extended. 

In order to bound Lxi^) — Ls{z) (requirement (1)), it is 
required to bound the tuple which encodes the last phrase. In 
LZ78 this tuple carries an index to a previous phrase, plus a 
new symbol. The number of previous phrases is bounded by 
n (a coarse bound, but sufficient for the current purpose), and 
therefore Il22l Lemma 13.5.1] its encoding will be of length 
logn + log log n + 1, and the length of the tuple will be 
log n + log log n + c (where c is a constant accounting also for 
rounding, encoding of the additional symbol, etc). Therefore, 



if the encoder ends the block with an indication of its length 
then A^'|^g(?i) = AlztsW < 21ogn + 21oglog?i + c. 
In LZ77 this tuple carries a pointer to the window and a 
length (i.e. two numbers bounded to {!,...,«}). Therefore 
after adding an indication of the length at the termination, 
Aff^jiri) ^ ALZ77{n) < 31ogn + Sloglogn + c. In both 
cases A'^^^{n) — O(logn) and the requirement is satisfied. 
Therefore the compression length L^g{z) may be substituted 
in Theorem m 

A result by Lempel and Ziv IS] Theorem 2 (item ii)] shows 
that for every finite s 

P78K) " , \^. L,sK) < p^(,)(z5') + (56) 
n log \X\ ^ ' 

0. By Theorem |4j for any e > 0, the system 

p(z) - 5n 



where is(n) — 

n— V( 

attains the rate 

R > Rem 



nlog \ X 

= (\-pM))\og\x\ 

>(\-ppi^,)(z\)-8,{n))\og\X 



(57) 



Fix a value 5. Since lim„_j.oo (^n = it is possible to find 
n\ large enough so that for any n > nl, Sn < 6. For a 
specific value of s, because lini„_j.oo (n) = 0, it is possible 
to find rij large enough so that for any n > rv^, 5s{n) < 5. 
Because pp(^s){'^i) @ is monotonically non-decreasing with 
s, then for any s, p(z) = linis^oo limsup„^^ pi^(5)(z'/) > 
limsup„_j.Q^ p^(s)(z") (i.e. the limit on s converges from 
below). For the same s, find n > nl,n2 so that ^^(^^(z") < 
limsup„_^o^ Pi?(s)(z") + S < p{z) + 6 (note that due to the 
lim sup this would not, in general, hold for any larger n). 
Writing ([57| for these s,n yields: 



(58) 



R>{l-PFis)iz'l)-S) \og\X\^S 

> (l-p(z)-25) log I A- 1 -<5 

= (l-p(z))log|A'|-(21og|A'Kl).^ 

Therefore the requirements of Theorem |2] are satisfied by 
substituting S = (21og|A'| + l)-^S. □ 



Proof of Corollary 2.1- The corollary follows directly from 
the definition, by application of Theorem |2] and Theorem [T] 

Proof of Corollary 2.2 Suppose the sequence z is drawn 
by a stationary ergodic source. The mutual information rate 
is /(X; Y) = H{Y) - i?(Y|X) < loglA"! - i?(Z), and to 
obtain an equality, the capacity is obtained by a uniform i.i.d. 
prior, which maximizes H{Y). Hence the capacity is C = 
log \X\ — H{Z). It was shown |6i Theorem 4] that the finite 
state compressibility equals the entropy rate of the source, with 
probability one. The proposed communication system would 
asymptotically attain the communication rate C, without prior 
knowledge of the noise distribution. 

C. Attaining a rate related to the compression length of the 
noise 

Here the result that the rate log \X\ — ^L{z) can be attained 
for a wide class of source encoders, is formalized and proved. 
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1) Attainability result: A class of sequential source en- 
coders is first defined, for which Theorem |4] applies. For each 
sequence z define Ls(z) as the unterminated coding length 
of the sequence, i.e. the length of the output of the encoder 
after the input z has been fed, but the sequence has not been 
terminated (i.e. the encoder is expecting additional input), 
and Lt{z) — L{z) as the terminated coding length, i.e. the 
length of the output when z is the complete sequence. The 
sequence z is uniquely decodable from the £t(z) bits of the 
terminated code, but not necessarily from the Ls{z) bits of the 
unterminated one. The difference Lt{z) — Ls{z) > is the 
information stored in the encoder which has not been output 
yet. The class is defined by the two requirements: 

1) The difference between the terminated and untermi- 
nated lengths is bounded by an asymptotically negligible 
value: ^(Lriz) - Lsiz)) < ^A^H 

This can be considered an embodiment of the limitation 
to "sequential" encoders and precludes encoders that 
process the entire sequence before producing outputs. 

2) The encoding length does not decrease when the se- 
quence is extended: Lt{z\) > Lt{z''-^^) 

Theorem 4. Given a sequential source coding scheme with 
input symbols from alphabet X that satisfies assumptions 
(1,2), and assigns a codeword length of L{z) to the sequence 
z € X"', then for any e > there exists a sequence of adaptive- 
rate encoders and decoders using common randomness and 
feedback, for increasing block lengths n over the channel 
y = x + zfx, y, ze X^), in which for any individual noise 
sequence z in probability at least 1— e, the message is correctly 
decoded with rate of at least 



where 



i?emp(z) -loglA"! - -L(z) 

n 



llog\X\ 



, n ■ \X\ , 
log ( — — ] + A2'^''(n) 



and AJ^^'<^(n) = max{AL(fe)}^^i. 



(59) 
(60) 



(61) 



2) The adaptive communication scheme: To achieve the 
claims of Theorem [4] a variant of the rate adaptive scheme 
used in a previous paper |5| is used. This scheme applies 
repeated "rateless" transmissions: fix a value K of the number 
of bits per block. Generate a random codebook of exp(ii') 
words chosen independently and distributed uniformly over 
X"^ which is known at the encoder and decoder (and comprises 
the common randomness). In each rateless block h = 1,2,..., 
the encoder sends K bits to the decoder, by sending the 
respective symbols from codeword indexed by those K bits. 
Note that at each block different symbols from the codebook 
are sent. The block terminates when a termination condition is 
satisfied at the decoder Then, the decoder stores the decoded 
bits and indicates this to the encoder, through the feedback 
link (a 0-1 feedback is sufficient), and a new block of K 



bits begins. The last block is potentially not decoded, if the 
termination condition is not satisfied at the last symbol. 

The decoding and termination rule are specified next. Sup- 
pose that the current symbol number is i and the block number 
is b. The last symbol of the previous block (number 6—1) was 
sent at symbol j (j = if 6 is the first block). Let xj denote 
the transmit sequence that follows from the previous decisions 
made by the decoder (i.e. is composed of the symbols from 
the codebook matching the decoded bits at each previously de- 
coded block), and let xj^j^(m) denote the transmitted symbols 
matching codeword m (ni = 1, . . . ,exp{K)). z'(m) defined 
below is the decoder's hypothesis on the noise sequence z*: 

zXm)=y^-(xi,x}+i(TO)) (62) 

Take z^ = y^ — xj to be the j length prefix of z* (to) (which 
is independent of to). The decoder calculates the following 
condition for all to = 1, ... , exp{K): 

TL 

LT{z\m))~-Ls{z^) < [{i^])-\og\X\-\og--K\ (63) 

e 

It announces the end of the block and decodes the bits 
matching codeword index to if the termination condition is 
satisfied with respect to codeword to (where ties can be broken 
arbitrarily), and does not terminate the block if the condition 
fails for all codewords. 

Regarding the termination condition ( [63] l note that the LHS 
starts from a negative value and increases linearly at a rate 
of log \ X\ bits per symbol, while the RHS starts from a non- 
negative value, but for a compressible noise sequence, it is 
expected to increase at a rate slower than logjA"! bits per 
symbol, therefore if the noise sequence is compressible and the 
block length n is large enough, the condition will eventually be 
met. The scheme suggested above differs from the previously 
proposed scheme |5| mainly in the fact the termination and 
decoding condition involves the entire past, rather than just 
the symbols in the current block. 

3) Proof of Theorem^ In order to prove Theorem |4] it is 
shown that the scheme above achieves an error probability 
of at most e, and if an error does not occur, the number 
of bits decoded (determined by the number of blocks sent), 
approaches i?emp for a suitable choice of K. 

Let us begin by bounding the error probability. First let s 
calculate the probability that the decoder decides in favor of 
an incorrect codeword at any given symbol i (where again j 
denotes the end of the previous block), by using a property 
of the sequential encoder. Consider a sequence z' of length i 
which is fed into the sequential source encoder in two stages: 
first, the first j symbols are fed (and the encoder has emitted 
Lg{z^) bits), and then the rest i — j symbols are fed and 
the encoding is terminated. Between the j-th and the i-th 
symbol, the encoder has emitted Lt(z*) — Ls{z^) additional 
bits, which can be used to uniquely decode Zj^^ when z-' 
is given (since the entire encoded stream can be generated 
from the first Lg{z^) bits plus these additional bits, and used 
to decode z*). Therefore the number of sequences z*_|_j^ for 
which Lt{z^) — Ls{z^) < d (where c? e N) is upper bounded 
by exp{d) (since they are in effect encoded by d bits). 

Since the codewords are independent, given the transmitted 
symbols, the other codewords in the codebook over the period 
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of the current block are independent sequences uniformly 
drawn from X^^K Therefore the hypothesized tail of the 
sequence Z*_(_j^(m) = ^j+i ~ -^j+i(™) for ^i^Y fixed m 
is also uniformly distributed (over the common randomness). 
Since there are at most exp{d) sequences that satisfy Lxiz^) — 
Lsi^-') < d, the probability that a particular sequence will 
satisfy the condition is at most 



exp(d) 



(64) 



and therefore by the union bound, the probability that any of 
the competing sequences will satisfy the condition is at most 



exp(d) exp(i^) 



exp(ci + i^-(i-j)log|A'| 



(65) 



Substituting the value of d given by the termination condition 
d^l{t-j)-\og\X\-\og^-K\ < (z-j).loglA'hlog^-i^, 
the error probability per symbol is at most exp(— log -) = -, 
therefore by the union bound over n symbols, the probability 
of any error occurring during the decoding process is at most 
^ • n = e. 

n 

Next let us analyze the rate achieved by the scheme. The 
analysis assumes no decoding errors occur. Denote the number 
of decoded blocks by B (so potentially there are B + 1 
blocks, if the last block is not decoded). The proof is based on 
bounding the value of L{z) based on the number of blocks, z 
denotes the true noise sequence. 

Suppose a block was decoded in symbol i and the previous 
block ended at symbol j. By choosing K (or n) large enough 
it can be guaranteed that decoding never happens at the first 
symbol of any block, therefore i > j + 1. By the assumption 
that no decoding errors occurred the sequence is identical to 
z^ . In symbol i~l the decoding condition was not met for any 
codeword, including the correct one, for which z'(m) = z*. 
Therefore it holds, with respect to the true noise sequence, 
that: 



LT{z'-^)-Lsiz') > it^l~j)\og\X\^\og--K (66) 



This is an inverted version of condition ^63f . Note that the floor 
operator [-J is not needed here since the LHS is an integer 

Using monotonicity of Lt and the bounded difference Lt — 
Ls the following telescopic series is lower bounded: 

LtK) - Lt{z^) = Lt(z*) - Ls{z^) - [Lt(z^) - Ls{z^)] 
> Lriz') ^ Lsiz^) - Ar^n) 



> Lt(z' 



Ls(z^)-A--(r 



undecoded one, which is taken as a block of length if the 
last block is decoded): 

Lriz) = Lt(z^'^+') - Lt{z^") 

B + l 



= Lriz^") - Lriz^"-^) 

6=1 
B+l 

> E ((j'" - 1 - J'"-!) log i-^i - log ^ - A- - Ar'^(n) 



6=1 



= 71 log \X\-{B + l)[K + log \X\ + log - + AT'^i-r^) 

(68) 



The actual rate achieved by the scheme is 

_ BK 

Rnr.t — 



(69) 



Extracting B from (68i and calculating Ract yields: 
^ _BK 

nact — 



> 



n 
K 



aog\X\~LT{z) 



K + log \X\ + log f + A5;^''(n) 



^ ^ log(|A-|n/e)+Ar"(n) ^ ^ 
log(|A-|n/e) + Ar'^(n) ^ 

1 ^ j Rcmp[ 



^cmp(z) 



K 



(a) 
> 



ib) 

^ -Rcmp(z) 



K 



log|A'|.(log(|A-|n/6) + Ar'^(n)) K 
K n 

(70) 

where (a) is because Vx > : {1 + x)~^ > 1 — x, and (b) 
is because i?omp(z) < loglA"!. To choose the value of K 
that approximately minimizes the overhead term in the lower 
bound, the foUowing lemma is used: 



Lemma 2. For a > 0, 6 > with b < a 



r = min — \- bk < 3v ab 

fcsN k 



(71) 



Proof: It is easy to see by derivation that the minimizer over 

X e R of ^ + bx is X* ^ Choosing k* = \x*~\ yields 

~ 1: 



k* eN and since < k* < 



— + bk* < —= + b 

— /a 



2Vab + 6 = 2Vab 




(72) 



b<a 

b-b < 3Va6 



□ 



> (z-l-j)log|A'|-log--X-A^> 

(67) 

where A'^^^{n) — max{AL(0}"=i- '■^^ same argument, 
this bound is true also for the undecoded block (with i — 1 = 
n). Taking jb (b = 1, . . . , S) to be the symbol in which block b 
ended, and adding jo — and js+i = n the following bound 
is obtained by summing ([67]i over B + l blocks (including the 



(n) Applying the lemma to the choice of X in (70 1 yields: 



-3, 



l\og\X\ 



log(n) + log 



\X\ 



+ A™^'=(n) 



where by assumption (1), 6„ 



0. 



(73) 



□ 
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D. Proof of Corollary \3.1\ 

The target is to find the required n such that A* < ^ log \X\, 
based on the bounds of Theorem [3] According to the lower 
bound ([9), either r < ^, or [log (fcr) '-^^ < 

5 log \ X\, which combined with [x] > x — 1 yields, after rear- 
rangement, log (fcr) < {2kS + 1) log i.e. r < 
This condition on t is always less strict than the former, and 
because at least one of the conditions should hold, the second 
always holds. Translating to a condition on n yields: 

T^^-k\Xr-''^' (74) 



> 



On the other hand, let us find an n for which the upper bound is 
at most i51og \X\. Define g(T) = r log (^). Assuming r < 
then g{T) is monotonically increasing, g(T) > Tlog(2A:), and 



fr^ < |r. Thus: 



log 



loge 



^ (^g +lj rloge 

9 9{r) 



<bir) + 



log(2fc) 
9 loge 



loge 



(75) 



5(r) 



81og(2fc), 
< 3.9(t). 

The same assumption t < ^ leads to n > k\X\*' and thus 
k 



< X 



(76) 



and 




loglA"! •log(fc2|A'|2fc+i) 



klXl'' 



5{^og\X\y 



(77) 



xif^ 

< lOloglA-l • \X\~''^^, 

where (a) is because logfc < (fc — l)loge, so fc^ < 
logg2(fc^i) < \x\3{k-i). Combining (|75]l, ^ with (fTO]) 
yields: 

X\~''\og{e\X\) + 10\og\X\-\X\-''/^ 
, log(e) 



and equivalently 

n — > 



\X[ 



[.g-i(i(<5-12.|A-|-fe/2).log|A'|),l/fc]- 

(81) 
□ 



E. Proof of Lemma [7] 

For the sake of brevity, as long as a single value of k 
is discussed, let A4 — X^ denote the super-alphabet of 
length k and m = \X\^ denote its size. Let 7r(-) define a 
distribution over 7W. The Dirichlet(i, ■ ■ ■ , |) density over the 
set of distributions is defined as: 



Wfc(7r) = exp(-C„0 ]^ 7r(a) 



-1/2 



aeM 



where 



log(r(l/2)"Vr(m/2)). 



(82) 



(83) 



and for a /-length vector a e Al', let 7r(a) = 11^=1 ""(oj) be 
the probability given to a by the i.i.d. distribution 7r( ). Let 

Pfc(a)= j 7r{a)Wk{Tr)dTT (84) 

define the weighted average of all probabilities given to a by 
i.i.d. distributions 7r(a), where the integral is over the unit 
simplex = {tt : Va e 7W : 7r(a) > 0, J2aeM ^(") = 
1}. By well known results of Shtarkov, which are detailed in 
Lemma 1 and Xie and Barron's paper ||231 . it holds that: 

log - ■ ^ ' < 



Pfe(a) 



log 

2 ^2tt 



-c„ 



m 



^ ) log e 



rik- 



(85) 



Note that the terms that do not scale with n are usually 
ignored, because m is considered fixed, however here they 
matter, because the question would be how fast m (equiva- 
lently k) may grow with n. Thus for any tt: 



7r(a) < Pfc(a)exp(r;fe). 



(86) 



A 



< 3g(r) 
= 3g(r) 



The same equality would hold when marginalizing the above 
to any parts of a (i.e. summing over the remaining elements 
of a). Using this observation, let us set I = [n/fc]. Then 
substituting in (|84ji, a = z^-^^ yields 



(87) 



< Mr) 



|;^|-fe/2 

2-1/2 



1 



log \X 
log(e) 



Pfc(z"-) - / nk{z"')wk{Tr)dT:, 
J Am 

10 ) log \X\ ■ summing both sizes with respect to zj^^i^j^, yields 



log 2 

<3g(r) + 121og|A'|.|A'|-'=/2. 



10 ) logl-Yl • 



(78) 



Pfc(z") 



(88) 



(89) 



Thus, to guarantee A* < S ■ log \X\ is it enough if r < t; and 



r<g-' [lis- 12 ■\X\-''/')-log\X\ 
i.e. it is required that 



T < min 



'(US-12-\X\-'^^^)-log\X\).l/k 



(79) 



(80) 



7rfc(z")wfc(7r)d7r. 

J Am 

Furthermore, by ( [86| l 

V7r:7rfe(z") <Pfc(z").cxp(r,fc). 



Let us now bound r/^. Following Xie and Barron's 
Remark 7], using r(l/2) = ^/^^ and Stirling's approximation 

m— 1 m 

r{m/2) > V27r(m/2) 2 2 yields from ([83j: 

TTl — " 1 TTl — 1 771 

C,n < — ^ — log TT - i log 2 — log(m/2) + — log e 

(90) 
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and from ( |85| ): 



m - 1 , / / 
rik < log — 

2 V 771 



4Z 



TTi I loge- ilog2. (91) 



Note that rik is always positive, even when I < m. When 
I < 771, the second factor dominates, and the normalized loss 
^ does not tend to zero. Therefore it is not useful to consider 
771 in this region. Assuming / > m (note that since m > 2 
this also implies n > k), and substituting I — \n/k] < n, 
m = \X\^, yields 



< 



\x\ 



2 °V \X[ 
4log2 

log 



471 



lA-l^ 1 lea 



\X\ 



k\X 



2k 



An 



A- r lege 



Now let 



then from rf89 



Pz(z)=^2-'=.Pfe(z) 



(92) 
(93) 



k=l 



and thus 



: 7rfc(z") < Pfc(z") • exp(rzfc) < exp(nfc)- (94) 



Vtt: -log^fe(z")< -logPz(z) + -(fclog(2) + mO. 

77 77 77 

(95) 

The factor ^ [k log(2) + rik) can be coarsely bounded by ( |92] l: 

fclog(2)+nfe ^ 



277 ^°H,|<^I^- 



47l2 



loge 

77 77 / (96) 



A I A" I 

with r = < 1. Combining this bound with p5) yields 
the result of the Lemma. □ 

F. Password channel for i.i.d. distributions 



As noted in Section IV-E even limiting the reference class 
it to i.i.d. input distributions would not solve the "password" 
problem, and therefore universality is not possible even with 
respect to such encoders, for general channels. To see this, 
consider the following example, where the channel identifies 
the input distribution of the encoder. This is a variation of the 
"password channel" (Example [3]). 

Example 5. The channel class is a class of binary input-output 
channels, parameterized by a single a parameter p G [0, 1]. For 
each value of p, the channel is as follows: 

• At each symbol k in time, if the normalized number of 
ones at the input is not within a range of thresholds 
[Lfe.p, Hk^p], then from this time on, the channel "locks" 



and the output is yk = 0. Otherwise, the channel is noise 
free and the output equals the input yk = Xk- 
• The threshold sequences Lk^p,Hk.p are computed such, 
that if the input is i.i.d. Ber(p), then with high probability 
1 — Co, the thresholds will not be crossed during any of 
the 77 symbols (i.e. the channel will not lock). Clearly, as 
k increases, the thresholds will converge to p. 

Thus, the channel "identifies" a certain input probability. 
Notice that all the channels are causal and deterministic, and 
they allow communication at a rate of approximately h\j{p). 
The "memoryless" reference schemes mentioned above can 
communicate over this channel using a Ber(p) input distribu- 
tion and approach this rate, with a small error probability. But 
a universal communication over the class is impossible. Until 
the channel locks, nothing can be inferred anything about p 
from the channel output. Therefore the transmit distribution 
of the universal scheme until the lock time is independent of 
p. On the other hand, any given input sequence, will "lock" 
some of the channels in the class. Therefore any operation of 
the universal system is bound to cause some of the channels 
to lock, and achieve an asymptotically zero rate. 
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